Practical Pandas 02: Exporatory Data Analysis

Welcome back. As a reminder, we've got a dataset with my cycling data from last year merged and stored in an HDF5 store. Today we'll use pandas, seaborn, and matplotlib to do some exploratory data analysis.


In [2]:
%matplotlib inline

import os
import datetime

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df = pd.read_hdf(os.path.join('data', 'cycle_store.h5'), key='merged')
df.head()


Out[3]:
Time Ride Time Ride Time (secs) Stopped Time Stopped Time (secs) Latitude Longitude Elevation (feet) Distance (miles) Speed (mph) Pace Pace (secs) Average Speed (mph) Average Pace Average Pace (secs) Ascent (feet) Descent (feet) Calories ride_id
0 2013-08-01 07:07:10 2014-08-25 00:00:01 1.1 2014-08-25 0 41.703753 -91.609892 963 0.00 2.88 2014-08-25 00:20:51 1251 0.00 2014-08-25 00:00:00 0 0 0 0 0
1 2013-08-01 07:07:17 2014-08-25 00:00:08 8.2 2014-08-25 0 41.703825 -91.609835 852 0.01 2.88 2014-08-25 00:20:51 1251 2.56 2014-08-25 00:23:27 1407 0 129 0 0
2 2013-08-01 07:07:22 2014-08-25 00:00:13 13.2 2014-08-25 0 41.703858 -91.609814 789 0.01 2.88 2014-08-25 00:20:51 1251 2.27 2014-08-25 00:26:27 1587 0 173 0 0
3 2013-08-01 07:07:27 2014-08-25 00:00:18 18.2 2014-08-25 0 41.703943 -91.610090 787 0.02 6.60 2014-08-25 00:09:06 546 4.70 2014-08-25 00:12:47 767 0 173 1 0
4 2013-08-01 07:07:40 2014-08-25 00:00:31 31.2 2014-08-25 0 41.704381 -91.610258 788 0.06 9.50 2014-08-25 00:06:19 379 6.37 2014-08-25 00:09:26 566 0 173 2 0

The columns should be pretty obvious. The pair (ride_id, Time) form a unique index for the DataFrame. We may set them to be the MulitIndex later. Pace is just $\frac{time}{distance}$.

Let's drop some of the redundent columns and do a bit of renaming.


In [4]:
df = df.drop(['Ride Time', 'Stopped Time', 'Pace', 'Average Pace'], axis=1)

def renamer(name):
    for char in ['(', ')']:
        name = name.replace(char, '')
    name = name.replace(' ', '_')
    name = name.lower()
    return name

df = df.rename(columns=renamer)

Timelines

Using a snazzy new feature coming out in pandas 0.15, we can easily access the time attributes of the the datetime columns under the .dt namespace.


In [5]:
sub = df[['time', 'distance_miles', 'ride_id']].copy()
sub['time'] = sub['time'].dt.time
sub = sub.sort(columns='time')

Let's break them into two parts, morning and afternoon.


In [6]:
morning = sub[sub.time < datetime.time(12)]
evening = sub[sub.time >= datetime.time(12)]

In [7]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(morning.time.values, morning.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_ylim(0, 8)


Out[7]:
(0, 8)

In [8]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(evening.time.values, evening.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_ylim(0, 8)


Out[8]:
(0, 8)

Fun. The horizontal distance is the length of time it took me to make the ride. I like this chart because it also conveys the start time of each ride. The plot shows that the morning ride typically took longer, but we can verify that.


In [9]:
is_morning = df.time.dt.time < datetime.time(12, 0, 0)

ride_time = df.groupby(['ride_id', is_morning])['ride_time_secs'].agg('max')
mean_time = ride_time.groupby(level=1).mean().rename(
    index={True: 'morning', False: 'evening'})
mean_time / 60


Out[9]:
evening    30.761667
morning    29.362716
Name: ride_time_secs, dtype: float64

So the morning ride is typically shorter! But I think I know what's going on. We were misleading with our plot earlier since the range of the horizontal axis weren't identical. Always check the axis!


In [10]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(morning.time.values, morning.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_xlim(datetime.time(6, 40), datetime.time(10, 40))
ax.set_ylim(0, 8)


Out[10]:
(0, 8)

In [11]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(evening.time.values, evening.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_xlim(datetime.time(15, 30), datetime.time(19, 30))
ax.set_ylim(0, 8)


Out[11]:
(0, 8)

In [12]:
axes = df.groupby('ride_id').plot(x='distance_miles', y='elevation_feet', color='k', alpha=.5)



In [13]:
df['is_morning'] = df.time.dt.time < datetime.time(12)

In [14]:
sns.tsplot(df, time="distance_miles", unit="ride_id", value="elevation_feet")


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-34c2210b4fa5> in <module>()
----> 1 sns.tsplot(df, time="distance_miles", unit="ride_id", value="elevation_feet")

/Users/tom/Envs/py3/lib/python3.4/site-packages/seaborn-0.4.dev-py3.4.egg/seaborn/timeseries.py in tsplot(data, time, unit, condition, value, err_style, ci, interpolate, color, estimator, n_boot, err_palette, err_kws, legend, ax, **kwargs)
    201     for c, (cond, df_c) in enumerate(data.groupby(condition, sort=False)):
    202 
--> 203         df_c = df_c.pivot(unit, time, value)
    204         x = df_c.columns.values.astype(np.float)
    205 

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/frame.py in pivot(self, index, columns, values)
   3273         """
   3274         from pandas.core.reshape import pivot
-> 3275         return pivot(self, index=index, columns=columns, values=values)
   3276 
   3277     def stack(self, level=-1, dropna=True):

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in pivot(self, index, columns, values)
    358                          index=MultiIndex.from_arrays([self[index],
    359                                                        self[columns]]))
--> 360         return indexed.unstack(columns)
    361 
    362 

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/series.py in unstack(self, level)
   1879         """
   1880         from pandas.core.reshape import unstack
-> 1881         return unstack(self, level)
   1882 
   1883     #----------------------------------------------------------------------

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in unstack(obj, level)
    432             return obj.T.stack(dropna=False)
    433     else:
--> 434         unstacker = _Unstacker(obj.values, obj.index, level=level)
    435         return unstacker.get_result()
    436 

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in __init__(self, values, index, level, value_columns)
     96 
     97         self._make_sorted_values_labels()
---> 98         self._make_selectors()
     99 
    100     def _make_sorted_values_labels(self):

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in _make_selectors(self)
    138 
    139         if mask.sum() < len(self.index):
--> 140             raise ValueError('Index contains duplicate entries, '
    141                              'cannot reshape')
    142 

ValueError: Index contains duplicate entries, cannot reshape

In [ ]:
df.groupby('ride_id').plot(x='distance_miles', y='elevation_feet')

In [ ]:
sns.tsplot(df, time="ride_time_secs", value="distance_miles", condition="is_morning", unit="ride_id")

In [15]:



Out[15]:
time ride_time_secs stopped_time_secs latitude longitude elevation_feet distance_miles speed_mph pace_secs average_speed_mph average_pace_secs ascent_feet descent_feet calories ride_id is_morning
0 2013-08-01 07:07:10 1.1 0.0 41.703753 -91.609892 963 0.00 2.88 1251 0.00 0 0 0 0 0 True
1 2013-08-01 07:07:17 8.2 0.0 41.703825 -91.609835 852 0.01 2.88 1251 2.56 1407 0 129 0 0 True
2 2013-08-01 07:07:22 13.2 0.0 41.703858 -91.609814 789 0.01 2.88 1251 2.27 1587 0 173 0 0 True
3 2013-08-01 07:07:27 18.2 0.0 41.703943 -91.610090 787 0.02 6.60 546 4.70 767 0 173 1 0 True
4 2013-08-01 07:07:40 31.2 0.0 41.704381 -91.610258 788 0.06 9.50 379 6.37 566 0 173 2 0 True
5 2013-08-01 07:07:42 33.2 0.0 41.704459 -91.610286 792 0.06 9.67 372 6.59 547 0 173 3 0 True
6 2013-08-01 07:07:54 45.2 0.0 41.704936 -91.610316 778 0.09 10.23 352 7.46 482 0 173 4 0 True
7 2013-08-01 07:08:01 52.2 0.0 41.705232 -91.610316 791 0.11 10.68 337 7.87 457 0 173 5 0 True
8 2013-08-01 07:08:03 54.2 0.0 41.705337 -91.610248 799 0.12 11.39 316 8.11 444 0 173 6 0 True
9 2013-08-01 07:08:14 65.2 0.0 41.705797 -91.610270 779 0.15 10.57 341 8.51 423 0 173 7 0 True
10 2013-08-01 07:08:24 75.2 0.0 41.706280 -91.610187 790 0.19 10.24 352 8.98 401 0 173 10 0 True
11 2013-08-01 07:08:34 85.2 0.0 41.706742 -91.610131 793 0.22 11.56 311 9.28 388 0 173 12 0 True
12 2013-08-01 07:08:43 94.2 0.0 41.707208 -91.610148 799 0.25 12.68 284 9.62 374 0 173 15 0 True
13 2013-08-01 07:08:50 101.6 0.0 41.707687 -91.610108 792 0.28 15.81 228 10.09 357 0 173 17 0 True
14 2013-08-01 07:08:57 108.2 0.0 41.708137 -91.610112 795 0.32 16.42 219 10.51 343 0 173 18 0 True
15 2013-08-01 07:09:00 111.4 0.0 41.708287 -91.610094 799 0.33 15.21 237 10.55 341 0 173 20 0 True
16 2013-08-01 07:09:02 113.3 0.0 41.708368 -91.609948 800 0.34 15.14 238 10.67 337 0 173 20 0 True
17 2013-08-01 07:09:11 122.3 0.0 41.708453 -91.609323 810 0.37 15.24 236 10.85 332 0 173 24 0 True
18 2013-08-01 07:09:19 130.3 0.0 41.708503 -91.608707 816 0.40 12.86 280 11.07 325 0 173 28 0 True
19 2013-08-01 07:09:23 134.3 0.0 41.708511 -91.608398 819 0.42 12.05 299 11.17 322 0 173 29 0 True
20 2013-08-01 07:09:43 154.3 0.0 41.708425 -91.607769 834 0.45 10.29 350 10.49 343 41 173 36 0 True
21 2013-08-01 07:09:51 162.3 0.0 41.708439 -91.607102 845 0.48 14.85 242 10.74 335 41 173 40 0 True
22 2013-08-01 07:09:59 170.9 0.0 41.708440 -91.606433 839 0.52 15.58 231 10.93 329 41 173 40 0 True
23 2013-08-01 07:10:06 177.4 0.0 41.708439 -91.605821 820 0.55 16.04 224 11.17 322 41 173 41 0 True
24 2013-08-01 07:10:13 184.7 0.0 41.708440 -91.605220 816 0.58 16.04 224 11.33 318 41 173 41 0 True
25 2013-08-01 07:10:21 192.3 0.0 41.708515 -91.604569 809 0.62 15.34 235 11.52 312 41 184 42 0 True
26 2013-08-01 07:10:28 199.2 0.0 41.708662 -91.603935 800 0.65 17.13 210 11.74 307 41 200 42 0 True
27 2013-08-01 07:10:35 206.2 0.0 41.708785 -91.603353 796 0.68 16.71 215 11.89 303 41 209 43 0 True
28 2013-08-01 07:10:41 212.3 0.0 41.708857 -91.602961 799 0.70 13.62 264 11.91 302 41 209 43 0 True
29 2013-08-01 07:10:51 222.3 0.0 41.708393 -91.602801 806 0.74 12.03 299 11.90 302 41 209 47 0 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15961 2013-10-10 07:29:02 1463.3 185.0 41.661385 -91.542541 755 6.47 10.47 344 15.92 226 203 328 306 44 True
15962 2013-10-10 07:29:03 1464.2 185.0 41.661333 -91.542399 748 6.48 13.38 269 15.93 226 203 328 306 44 True
15963 2013-10-10 07:29:05 1466.2 185.0 41.661334 -91.542143 735 6.49 17.82 202 15.94 226 203 328 306 44 True
15964 2013-10-10 07:29:06 1467.3 185.0 41.661261 -91.542036 729 6.50 19.88 181 15.95 226 203 339 306 44 True
15965 2013-10-10 07:29:11 1472.3 185.0 41.661266 -91.541427 703 6.53 23.86 151 15.97 225 203 377 306 44 True
15966 2013-10-10 07:29:16 1477.2 185.0 41.661271 -91.540820 692 6.56 23.10 156 16.00 225 203 377 306 44 True
15967 2013-10-10 07:29:21 1482.4 185.0 41.661274 -91.540171 682 6.60 23.27 155 16.02 225 203 377 307 44 True
15968 2013-10-10 07:29:26 1487.3 185.0 41.661277 -91.539543 674 6.63 23.72 152 16.05 224 203 405 307 44 True
15969 2013-10-10 07:29:32 1493.5 185.0 41.661281 -91.538852 677 6.67 22.03 163 16.07 224 203 410 309 44 True
15970 2013-10-10 07:29:41 1502.2 185.0 41.661285 -91.538190 682 6.70 12.29 293 16.06 224 203 410 312 44 True
15971 2013-10-10 07:29:51 1512.2 185.0 41.661295 -91.537755 679 6.72 8.48 424 16.01 225 203 410 313 44 True
15972 2013-10-10 07:29:54 1515.3 185.0 41.661505 -91.537672 678 6.74 12.59 286 16.01 225 203 410 313 44 True
15973 2013-10-10 07:30:01 1522.3 185.0 41.661987 -91.537653 677 6.77 17.34 208 16.01 225 203 410 314 44 True
15974 2013-10-10 07:30:02 1523.3 185.0 41.662019 -91.537652 676 6.77 15.92 226 16.01 225 203 410 314 44 True
15975 2013-10-10 07:30:03 1524.3 185.0 41.662041 -91.537593 676 6.78 14.54 248 16.01 225 203 410 314 44 True
15976 2013-10-10 07:30:04 1525.2 185.0 41.662037 -91.537666 676 6.78 14.54 248 16.00 225 203 410 314 44 True
15977 2013-10-10 07:30:05 1526.2 185.0 41.662106 -91.537775 676 6.79 16.14 223 16.01 225 203 410 314 44 True
15978 2013-10-10 07:30:06 1527.2 185.0 41.662124 -91.537731 677 6.79 14.84 243 16.01 225 203 410 314 44 True
15979 2013-10-10 07:30:07 1528.2 185.0 41.662192 -91.537682 678 6.80 14.90 242 16.01 225 203 410 314 44 True
15980 2013-10-10 07:30:10 1531.4 185.0 41.662320 -91.537746 682 6.81 13.33 270 16.00 225 203 410 315 44 True
15981 2013-10-10 07:30:13 1534.4 185.0 41.662452 -91.537677 684 6.82 13.77 261 15.99 225 203 410 316 44 True
15982 2013-10-10 07:30:15 1536.2 185.0 41.662497 -91.537527 685 6.82 14.14 255 15.99 225 203 410 316 44 True
15983 2013-10-10 07:30:31 1552.3 185.0 41.662616 -91.536922 692 6.86 6.70 537 15.90 226 203 410 319 44 True
15984 2013-10-10 07:30:53 1574.3 185.0 41.662583 -91.536315 707 6.89 5.50 654 15.75 229 231 410 321 44 True
15985 2013-10-10 07:30:57 1578.2 185.0 41.662569 -91.536220 710 6.89 5.00 721 15.72 229 240 410 321 44 True
15986 2013-10-10 07:31:00 1581.3 185.0 41.662641 -91.535916 712 6.91 9.28 388 15.73 229 240 410 322 44 True
15987 2013-10-10 07:31:15 1588.2 193.1 41.662581 -91.535310 722 6.94 7.01 514 15.73 229 240 410 323 44 True
15988 2013-10-10 07:31:29 1588.2 207.1 41.662550 -91.535287 722 6.94 0.98 3677 15.74 229 240 410 323 44 True
15989 2013-10-10 07:31:50 1588.2 227.6 41.662586 -91.535258 722 6.95 0.86 4174 15.74 229 240 410 323 44 True
15990 2013-10-10 07:31:52 1588.2 230.0 41.662586 -91.535258 722 6.95 0.51 7074 15.74 229 240 410 323 44 True

15991 rows × 16 columns


In [ ]:


In [ ]:


In [ ]: